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ABSTRACT 

This paper deals with the computation of the rank and some 
integer Smith forms of a series of sparse matrices arising 
in algebraic K-theory. The number of non zero entries in 
the considered matrices ranges from 8 to 37 millions. The 
largest rank computation took more than 35 days on 50 
processors. We report on the actual algorithms we used to 
build the matrices, their link to the motivic cohomology and 
the linear algebra and parallelizations required to perform 
such huge computations. In particular, these results are 
part of the first computation of the cohomology of the linear 
group GLf(Z). 

1. INTRODUCTION 

1.1 Motivation from K-theory 

Numerous problems in modern number theory could be solved 
or at least better understood, if we have a good knowledge 
on the algebraic Jf-theory (or motivic cohomology) of inte- 
gers of number fields or the cohomology of arithmetic groups 
(i.e. subgroups of finite index of GLjv(Z)). As a short list, 
we could mention: 



• modular forms and special values of L functions, 

• Iwasawa theory and understanding of the "cyclotomy" , 

• Galois representations (or automorphic representations). 



Let us explain first what is algebraic K-theory: to a commu- 
tative ring R we can associate (functorially) an infinite fam- 
ily of abelian groups K n (R) which encodes a huge amount 
of information on its arithmetic, geometric and algebraic 
structures. These groups extend some classical notions and 
give higher dimensional analogues of some well known re- 
sults. For instance if R is a ring of integers or a polynomial 
ring over a field, we have 



• Kq(R) is the classical Grothendieck group (classifying 
finitely generated R— modules), 

• Ki(R) is the group of invertibles of R, 

• K2{R) classifies the universal extensions of SL(R) and 
is related to the Brauer group in the case of a field. 



We can give a general abstract definition of K n for n > 0, 

K n (R) = 7v n (BGL(R)+) , 

where BGL(R) + is the Quillen +- construction applied to 
the classifying space BGL(R) and Tv n denotes the nth ho- 
motopy group [23]. We also can see the A'-groups as a way 
to understand GL(R). For instance, Ki(R) is isomorphic 
to GL(R) modulo elementary relations. For a more detailed 
background on K-theory and its applications see |23| . 

Fact: these groups are hard to compute. 



1.1.1 The Vandiver conjecture as an illustration 

The Vandiver conjecture plays an important role in the un- 
derstanding of the "cyclotomy" in number theory. Its state- 
ment is as follows; 

Conjecture of Vandiver: Let p be an odd prime, ( p — 
e 2 ' r!/ ' p , C the p-Sylow subgroup of the class group of Q(C P ) 
and C + the subgroup fixed by the complex conjugation. The 



Vandiver conjecture is the statement that C + = 0. 

To illustrate its interplay with K-theory, we have 

Fact (Kurihara, 1992) [22]: If K 4n (Z) = for all n > 0, then 

the conjecture of Vandiver is true. 

Some partial results on this conjecture and its connection 
with the cohomology of 5*Ljv(Z) are given in [25] . 
General problem: Find explicit methods for computing 
(co)homologies of arithmetic groups and the K-theory of 
number fields (or their ring of integers). 
Our first task will be the computation of the (co)homolo- 
gies of linear groups (mainly GLjv(Z) and SLjv(Z)). We 
can show [151 1131 114] that the computation of those groups 
is the key point for computing if-groups with our method. 
We will begin to recall some facts from topology. 

1.1.2 Topological Excursion: Cellular complexes, 

or how to simply describe the "combinatorial" 

structure of a topological space 
The notion of cellular complex is a generalization of a graph 
to several dimensions. We call n—cell a topological space 
homeomorphic to the open unit ball of R" and such that its 
closure is homeomorphic to the closed unit ball. A cellular 
complex (or cell complex or also cellular decomposition or 
cellular space) is a family of sets X n (with n 6 N), such 
that each X n is a collection (eventually infinite) of n— cells. 
Usually we work with cell complexes with a finite number 
of cells. 

A classical result [25] shows that any (reasonable) topological 
space can be approximated by such cell complexes. 




A cellular decomposition of the cube 

1.1.3 Computation of the homology of a cell complex 
To a cell complex, we can associate a family C„ (n £ N) 
of free Z— modules and a family d n : C' n — » C„-i of linear 
maps. 

The module C n is the free module with basis the n— cells 
(modulo a choice of orientation). 

If the complex is finite, then all the modules are of finite rank 
and we will denote by {&a}asa„ a basis of C„, A n being an 
index set for the n— cells. Then the map d n is defined by 



and the integer number [&" : is called the incidence 

number of the cell e^ _1 inside the cell ej. The relation 
d n o d n +i = (i.e., formally d 2 = 0) should hold for any n. 
If the complex is regular (i.e., always at most one cell of 
dimension n + 1 between two cells of dimension n), then 
we can build the incidences inductively starting from the 
0— cells up to the maximal cells using the d 2 — condition 
and moreover the incidence numbers will be 0, or ±1. 
The nth homology group of the complex is defined as the 
quotient of Ker(d„) by Im(d„+i). This construction is func- 
torial (in the category of cell complexes). As a consequence, 
we can determine the homology groups effectively by comput- 
ing the Smith form of the integral matrices of the d n (rela- 
tively to the fixed basis). Notice that the Smith form gives 
both the rank of the free part and the explicit description of 
the torsion part. In case the computation of the torsion is 
unnecessary (or to difficult to achieve), we can tensorize by 
— CS>z Q and the homology groups become Q-vector spaces 
with their dimensions given by the ranks of the matrices of 
the differentials. 

In general, the matrices of the differentials can be very large 
even for a relatively simple cell complex. However, they are 
also very sparse and we may look for some other favorable 
properties which would enable the computation despite the 
size of the problem. 

1.1.4 How to use such settings for the computation 

of linear groups ? 
If G is a group acting on a cell space X (i.e., G sends n— cells 
to ?i— cells), then, under some technical assumptions on X 
and on the action, we can show [4] that roughly computing 
the homology of G (as group homology) is the same as com- 
puting the homology of the cell space X/G. Hence, if X/G 
can be calculated effectively, we can compute explicitly its 
homology, and from this the homology of G (similarly for 
the cohomology). Notice that in general the space X/G will 
not be regular anymore. The main difficulty is to find a cell 
space X such that X/G will be effective. We will discuss 
in section 12.11 how we can construct such cellular space for 
linear groups. 

1.2 Parallelism motivations 

This first idea to deal with very large sparse matrices is 
to use them as blackboxes, i.e. only using the matrix-vector 
product. This will let the matrix remain sparse all along the 
algorithm where Gaussian elimination for instance would fill 
it up. To compute the rank, the fastest black box algorithm 
is Wiedemann's as shown in [10] , This algorithm computes 
a sequence of scalars of the form u t A t v (u and v are vec- 
tors) with i matrix vector products and dot products. It 
has been shown to successfully deal with large sparse ma- 
trix problems from homology, see e.g. [9]. Nevertheless, 
when matrices are very large (e.g millions on non-zero en- 
tries) computations would require months or years. This is 
due to the low practical efficiency of the computation of a 
sparse matrix- vector product. For instance, in our case of 
homology computation, one would need 300 days of CPU 
to compute the sequence involving matrix of GLt(/L) with 
n = 19 (GL7dl9 matrix). To achieve computations of many 
homologies in a realistic time we then need to parallelize the 
computation of the sequence. Then the algorithm candidate 



is the block Wiedemann method, which computes a sequence 
X T A i Y where X and Y are blocks of vectors. This step can 
be easily parallelized by distributing vectors of block Y to 
several processors. We thus have several objectives with re- 
gard to the parallelism in this paper: 

• We want to solve large problems coming from homol- 
ogy computation. 

• We want to experimentally validate our parallel imple- 
mentation of the block Wiedemann rank algorithm. 

• We want to show the parallel scaling of block Wiedemann 
approaches. 

1.3 Summary of the paper 

In section [2] the algorithms and optimizations we used to 
generate the matrices and compute with them are discussed. 
Then, section [3] shows our experimental results on these 
large sparse matrices coming from homology. 

2. ALGORITHMS AND OPTIMIZATIONS 

As seen in section [TTT]3j we can effectively compute the ho- 
mology of a cellular space and of a group which acts "nicely" 
on a cellular space. The main difficulty remains to find such 
explicit cellular space. In section [2~T1 we present the process 
of matrix generation and optimization. Then in the follow- 
ing sections we give a description of the algorithms used for 
the computation of the rank and the Smith form of those 
matrices. 

2.1 Matrices generation 

In the case of subgroups of GLjv(Z), we have an "obvious" 
action on Z N . We can then capture the topology by regard- 
ing 7L N not as a free Z— module but as a lattice (or equiv- 
alently as quadratic forms), and see if this leads to some 
interesting topological construction. We will describe this 
approach below and the results that we can get. 

2.1.1 Voronoi's reduction theory 

Let N ^ 2 be an integer. We let Cn be the set of positive 
definite real quadratic forms in N variables. Given h G Cn, 
let m(h) be the finite set of minimal vectors of h, i.e. vectors 
v G Z , v 7^ 0, such that h(v) is minimal. A form h is called 
perfect when m(h) determines h up to scalar: if h' G Cn is 
such that m(h!) = m(h), then h! is proportional to h. 

Example 2.1. The form q(x,y) = x 2 +y 2 has minimum 1 
and minimal vectors ±(1,0) and ±(0,1). Nevertheless this 
form is not perfect, because there is an infinite number of 
definite positive quadratic forms having these minimal vec- 
tors. 

On the other hand, the form q(x, y) = x 2 + xy + y 2 has also 
minimum 1 and has exactly 3 minimal vectors (up to sign), 
the one above and ±(1, —1). This form is perfect, the associ- 
ated lattice is the "honeycomb lattice" (with optimal spheres 
packing in the plane), it is the only one. 

Denote by C N the set of non negative real quadratic forms 
on 1^ the kernel of which is spanned by a proper linear 
subspace of Q , by X N the quotient of C N by positive real 
homotheties, and by n : C N — » X N the projection. Let 



Xn = 7r(Cjv) and dX N = X N — Xn- Let F be either 
GLjv(Z) or SLjv(Z). The group F acts on C N and X N 
on the right by the formula 

h-j = 7*/l7, 7 £ r , h £ C N , 

where h is viewed as a symmetric matrix and 7* is the trans- 
posed of the matrix 7. Vorono'f proved that there are only 
finitely many perfect forms modulo the action of F and mul- 
tiplication by positive real numbers ([22]> Th. p. 110). 
Given v € Z N - {0} we let v G C N be the form defined by 

v(x) = (v I x) 2 , iff , 

where (v \ x) is the scalar product of v and x. The convex 
hull of a finite subset B C Z N — {0} is the subset of X N 
image by tt of the elements J^Aj vj, Vj € B, Xj ^ 0. For 

j 

any perfect form h, we let o(h) C X N be the convex hull of 
the set m(h) of its minimal vectors. Vorono'f proved in |32l 
§ 8-15], that the cells o(h) and their intersections, as h runs 
over all perfect forms, define a cell decomposition of X N , 
which is invariant by the action of F. We endow X N with 
the corresponding CW-topology. If r is a closed cell in X N 
and h a perfect form with r C o(h), we let m(r) be the set 
of vectors v in m(h) such that v lies in r. Any closed cell r 
is the convex hull of m(r) and m(r) n m(r') = m(r PI r'). 

2.1.2 Voronoi's complex 

Let d(N) = N(N + l)/2 - 1 be the dimension of X* N and 
n ^ d(N) a natural integer. We denote by E* a set of 
representatives, modulo the action of F, of those cells of 
dimension n in X N which meet Xn, and by E n C E* the 
cells a such that the stabilizer r CT of a in F preserves its 
orientation. Let V n be the free abelian group generated by 
E n . We define as follows a map 

d„ : Vn —> Vn—l ■ 

For each closed cell o in X N we fix an orientation of a, i.e. 
an orientation of the real vector space R(cr) of symmetric 
matrices spanned by the forms v, v G m(a). Let a £ E n 
and let r' be a face of o. Given a positive basis B' of R(t') 
we get a basis B of R(cr) by adding after B' a vector v, 
v G m(a) — m(r'). We let e(r',cr) = ±1 be the sign of the 
orientation of B in the oriented vector space M(cr) (this sign 
does not depend on the choice of v). 

Next, let r G E n _i be the cell equivalent to r 1 = r • 7. We 
define T](t,t') = 1 (resp. t](t,t') = —1) when 7 is com- 
patible (resp. incompatible) with the chosen orientations of 
R(t) and R(r'). 

Finally we define 

dn(cr)= J2"0(t-,t')e(t',o)t, (1) 

tES„_! t' 

where t' runs through the set of faces of o which are equiv- 
alent to r. 

It is shown in [14], that up to p-torsions with p ^ N + 1, the 
homology of this complex computes the cohomology of G. 
For N — 5, 6, 7 we get the following results for E n . 
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Figure 1: Cardinalities of E„ and E* for N = 5,6 (empty slots denote zero) 
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Figure 2: Cardinalities of En and E n for N = 7 



Theorem 2.2. (Elbaz-Vincent/Gangl/Soule)lT5lUMW. 
The cardinalities of E„ and EJj are shown on figure \7\ for 
N = 5, 6 and on figure^ for N — 7. 

The previous result gives the precise size of the matrices 
involved in the computation of the homology. 

The main challenge was then the computation of the ranks of 
the matrices of the differential (this gives the free part of the 
homology) and the computation of the Smith forms (which 
gives the relevant arithmetical information of the homology), 
in particular for N — 7, knowing that such matrices are 
particularly sparse. We can emphasize the fact that what 
we want to detect is the "high torsion" in the homology (i.e. 
the prime divisors > 7 of the Smith invariants). 

In the following paragraphs we will discuss the different 
methods chosen for the computations and to take up the 
challenge^- 

2.2 Coppersmith Block Wiedemann 

One successful approach to deal with linear algebra compu- 
tations on large sparse matrices is to rely on Lanczos/Krylov 
black-box methods. In particular, block versions of Wiede- 
mann method [33] are well suited for parallel computation. 
This technique has been first proposed by Coppersmith in [6] 
for computation over GF(2) and then analyzed and proved 
by Kaltofen [19] and Villard [30] [31]. The idea is for a 
matrix A G p nxn to compute the minimal generating ma- 
trix polynomial of the matrix sequence {X A l Y}'*L € P xs , 

: A11 the matrices are available on line 
in the "Sparse Integer Matrix Collection" 
(ljk. imag . f r/membres/ Jean-Guillaume .Dumas/simc .html ) 



where X, Y are blocks of s vectors (instead of vectors in the 
original Wiedemann's algorithm). Therefore, an intuitive 
parallelization of this method is to distribute the vectors of 
the block X, Y to several processors. Thus, the computation 
of the sequence, which is the major performance bottleneck 
of this method, can be done in parallel and then allow for 
better performance. 

Lots of implementations and practical experimentations has 
been developed on parallel block Wiedemann. For instance, 
in 1996, Kaltofen and Lobo [20] have proposed a coarse grain 
implementation to solve homogeneous linear equations over 
GF(2). They have thus been capable to solve a system of 
252 252 linear equations with about 11.04 million non-zero 
entries, in about 26.5 hours using 4 processors of an SP-2 
multiprocessor. 

Lately, in 2001, Thome in [27] improved Coppersmith's al- 
gorithm by introducing matrix half-gcd's computation, and 
its implementation [28] was able to outperform Kaltofen- 
Lobo's software. One may remark that introduction of ma- 
trix gcd was first suggested by Villard in [30] who relied on 
the work of Beckermann and Labahn [2] on power Hermite 
Pade approximation. Finally, Giorgi, Jeannerod and Vil- 
lard have generalized in [16] block Wiedemann algorithms 
by introducing cr-basis computation and then reducing the 
complexity to polynomial matrix multiplication. A sequen- 
tial implementation of this algorithm is now available in the 
LinBox library (www.linalg.org). 

2.3 Block symmetry 

In order to reduce the number of dot products, we used a 
symmetric projection. In other words, we set X = Y T in 
the XA % Y sequence. Indeed, in this case the probability of 
success is reduced but the obtained block is symmetric as 



soon as A is symmetric. This is always the case when the 
preconditioners of [10] are used (they are of the form A T A). 
This reduces the dot product part of the computation of the 
sequence by a factor of two. For instance column i can be 
deduced from its top i elements and row i. This induces 
some load balancing issues when one process owns the com- 
putation of one column. Note also that we use BLAS level-2 
for the computation of this dot products. In other words we 
perform them by blocks. 

2.4 a-basis computations 

In order to efficiently compute cr-basis we rely on algorithm 
PM- Basis of [16] which reduces this computation to poly- 
nomial matrix multiplication. One can multiply two poly- 
nomial matrices A,B G F nxn [x] of degree d in 0(n 3 d + 
n 2 dlogd) finite field operations if d-th primitive roots of 
unity are available in F. Consequently, we decided for our 
computations to define F as a prime field with primes of the 
form c x 2 fc + 1 such that c x 2 k = mod d. These primes 
are commonly called FFT primes since they allow the use 
of FFT evaluation/interpolation for polynomials of degree 
d. We refer the reader to [5j [3] and references therein for 
further informations on fast polynomial matrix arithmetic. 

When finite fields not having d-th primitive roots of unity 
are used, polynomial matrix multiplication is still be done 
efficiently by using Chinese Remainder Theorem with few 
FFT primes. Let be F a prime field of cardinality p, then 
the multiplication of A, B £ F nxn [a;] of degree d can be ef- 
ficiently done by using CRT with FFT primes pi satisfying 
Q Pi > d x n x p 2 . This is equivalent to perform the multi- 
plication over the integers and then reduce the result in F. 
The overall performance of the multiplication, and then of 
the a-basis , is dependent on the numbers of FFT primes 
needed. 

2.5 Rank 

Our main interest in the block Wiedemann approach is to 
compute the rank of large sparse matrices given by the Ho- 
mology group determination problem explained in section 
11.11 Hence, we rely on Kaltofen-Saunder's rank algorithm 
[21] and its block version [29] to achieve efficient parallel 
computation. 

The Kaltofen-Saunders approach is based on the fact that 
if A is a good preconditioned matrix of A then its rank 
is equal to the degree of its minimal polynomial minus its 
valuation (or co-degree) [2T]. Thus, by using well chosen 
preconditioners and Wiedemann algorithm one can easily 
compute the rank of a sparse matrix over a finite field. The 
block version of this method is presented e.g. in |29l §4]. 
We recall now the basic outline of this algorithm. 

Block Wiedemann Rank Algorithm : 
let A € F nxn , 

1 form A from A with good preconditioners (e.g. those 

of us]). 

2 choose random block Y £ f nxs and compute the 

matrix sequence S = {Y T A l Y} for i = . . . 2n/s + O(l). 

3 compute the minimal matrix generator Fy € F sxs [x] 
of the matrix sequence S. 



4 return the rank r as r = deg(det(Fy ))— codeg det(Fy). 

Note that if the minimal matrix of step 3 is in Popov form 
(e.g. computed using the cr-basis of [3]), then the degree of 
det(_Fy ) is simply the sum of the row degrees of the matrix 
Fy. Then the co-de gree is zero if the determinant of the 
constant term of Fy, seen as a matrix polynomial, is non- 
zero. In the latter case the computation of the determinant 
of the whole polynomial matrix can be avoided. 

When this fails, this determinant is computed by a mas- 
sively parallel evaluation/interpolation. It could be inter- 
esting, though, to interpolate only the lower coefficients of 
this polynomial incrementally. This was not required for the 
matrices we considered and we therefore did not investigate 
more on these speed improvements. 

Note that to probabilistically compute the rank over the in- 
tegers, it is sufficient to choose several primes at random 
and take the largest obtained value, see e.g. [9] for more 
details. Moreover, one can choose the primes at random 
among primes not dividing the determinant (and thus pre- 
serving the rank). In order to ensure this property it it 
sufficient to select primes not dividing the valence or last 
invariant factor computed by one of the methods of next 
section. 

2.6 Smith form 

The computation of the Smith form for the matrices of 
GL-r(Z) turned out to be a very challenging problem. 

2.6.1 Smith for via the Valence 

Prior experience with sparse homology matrices led us to 
try the SmithVia Valence algorithm of [9]. The idea is to 
compute the minimal valence (the coefficient of the small- 
est non zero monomial of the minimal polynomial) of the 
product A T A to determine the primes p which divide the 
invariant factors of the Smith form of A. When the primes 
have been found, one can compute the local Smith forms of 
A at each p separately and return the resulting Smith form 
S as the product of the local Smith forms S p over all p. 
Local Smith form computation can be done by a repeated 
Gauss elimination modulo p e where the exponent e is ad- 
justed automatically during the course of the algorithm. 

This algorithm works very efficiently for sparse matrices pro- 
vided that the minimal polynomial of the product AA T has 
a small degree. Unfortunately, the latter condition does not 
hold in the case of GLy(Z) matrices. Moreover, some early 
experiments with small matrices showed that much more 
primes occur in the computed valence than in the Smith 
form of the original matrix. 

2.6.2 Saunders and Wan 's adaptive algorithm 
Thus, we decided to apply the adaptive algorithm of Saun- 
ders and Wan [22] which is a modified version of Eberly- 
Giesbrecht-Villard algorithm [12]. In [12| the authors pro- 
posed a procedure OneInvariantFactor(i, A) (OIF) which 
computes the ith invariant factor of a n x n matrix A. Then 
the binary search for distinct invariant factors allows them 
to find the Smith form of A. OIF reduces the ith factor 



computation to the computation of the last (nth) invariant 
factor (LIF) of a preconditioned matrix A + UiVi, where 
Ui,V? are random n x (n — i) matrices. In the method 
was extended to handle the rectangular case ofraxn ma- 
trix. It is done by computing the last (ith) invariant factor 
of a preconditioned matrix LiARi where L; is a m x i and 
Ri is a i x n matrix. 

The procedure OIF is of Monte Carlo probabilistic type 
where the probability of correctness is controlled by repeat- 
ing the choice of preconditioners. Assuming the correctness 
of LIF computation, it gives a multiple of the ith invariant 
factor. In practice, LIF is also of randomized Monte Carlo 
type. The idea is to get a divisor of the last invariant factor 
by solving a linear equation Mx = b with random right- 
hand side. After several solvings we get the last invariant 
factor with large probability. Thus, the overall situation is 
more complex and we cannot exclude the possibility that 
some primes are omitted or unnecessary in the output of 
OIF. However, the probability that a prime is omitted or is 
unnecessary in this output can be controlled for each prime 
separately and is smaller for bigger primes. 

Therefore in [21] the authors introduce a notion of smooth 
and rough parts of the Smith form. The idea is to compute 
the local Smith form for smaller primes by for example the 
SmithVia Valence or OIF algorithm and to recover only large 
primes with the invariant factor search of [12] , When we 
consider large primes, a sufficient probability of correctness 
can be obtained by a smaller number of repetitions. 

2.6.3 More adaptiveness 

As we did not want to compute the valence, we introduced 
some minor changes to the algorithm, which at the end is 
as follows: 

1. r = rank(^4) 

2. For primes 1 < p < 100 compute the local Smith form 
S p of A; 

3. Compute s r (A) by OnelnvariantFactor algorithm; 

4. P = all primes p > 100 which divide s r (A); 

5. IfP = return 5" = US P ; 

p 

One advantage of this method is that we get the information 
on the smooth form of the matrix very quickly. Moreover, 
the OIF computation acts as a certification phase which 
allows us to prove that no other primes are present with a 
sufficiently large probability. This probability is explicit in 
the following theorem: 

Theorem 2.3. The probability that there exists a prime 
p > P that divides the ith invariant factor but does not divide 
the output of OIF which uses M random preconditioners 
Li, Ri and N random vectors b, with b € {0, 1, . . . /3— 1} and 
f3 > Si(A), in the LIF procedure is bounded by 




Proof. As we take the gcd of the result with different 
preconditioners Li and Ri , is suffices that the LIF computa- 
tion fails in one case to spoil the computation. We are free 
to choose a large bound for ||fe|| such that > Si(A) with- 
out increasing the complexity of LIF computation. Then 
the probability that a prime p < ||&|| is omitted in LIF is 
less than or equal to ^["^1 < |, see [T]. Finally, we bound 
the probability that any prime p > P, p \ Si (A) is missing by 
taking a sum over all primes. □ 

The choice M = N = 2 suffice to obtain a small probability 
0, 015 of omitting an important prime, and at the same time 
to exclude all primes that are not in the ith invariant factor. 
In our experiments, there was no need to perform the com- 
putation for any additional prime p > 100 as all the primes 
were excluded by the OIF computation. This is one of the 
most important advantages over the valence computation. 

The algorithm [24| was stated in the case of dense matrices. 
We slightly modified it in order to exploit the sparse struc- 
ture of the matrix. In particular, we used the sparse local 
Smith form computation of [9j Algorithm LRE] but stick to 
the dense Dixon solver [7] as long as the memory was suffi- 
cient. Any other solver, including the new sparse solver of 
[11] could potentially be used for larger matrices. 

The limits of this method are imposed by the available mem- 
ory. For example, it was possible to use the dense Dixon 
solver only for the six smallest matrices. Furthermore, sparse 
elimination reached its limits for matrices of size greater 
than 171375 x 47271 and 21074 x 105054 when the filling 
of the matrices started to be impossible to handle. For the 
460261 x 171375 matrix GL7dl5 and 105054x349443 matrix 
GL7d23, specialized space-efficient elimination procedures 
mod 2, 3 and 5 allowed us to compute the rank mod 2, 3 
and 5 respectively. 

2.6.4 Chain Reductions 

The encountered problems have shown a need for a more 
elaborated reduction algorithm. We focused our attention 
on the algebraic reduction algorithm for chain complexes of 
[18] . We implemented a simplified version of the algorithm 
in the language of matrices using the LinBox library. The 
heuristic behind this algorithm is that Gaussian elimination 
can propagate from one matrix of a chain complex to the 
next thanks to the exactness of the differential map (i.e. 
d 2 — condition). 

The motivations come from the geometric properties of ho- 
mologies. By a free face we refer to a (k — 1) cell a which 
is in the differential of only one fc-dimensional cell b. By 
removing the pair (a, b) we obtain a retract of the initial 
cell complex (viewed as a geometrical object), see Figure 
131 The process can be repeated. From the homology theory 
we know that the groups of homology are the same for the 
set and its retract. The removal of pairs leads to a reduction 
of the basis of the cell complex. We refer to [171 Ch.4] for a 
full description of the procedure. 

If the differential map is represented by matrices whose rows 
represent the cells of dimension k — 1 and columns - dimen- 
sion k, the removal of a pair (a, b) can be interpreted as the 
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Figure 3: Retraction for a square 



removal of a row with only one non-zero entry (1-row) and 
the column it points to. In the general case, the algebraic 
reduction of (a, b) such that dk(b) — Xa + u is possible iff A 
is invertible in the ring of computation i.e. Q, Z, Z p t - de- 
pending on the problem. A modification d of the differential 
given by the formula 



d k v = d k (v) - A 1 [v : a]d k (b). 



(2) 



Thus, in the basic case of free face removal no modification 
is needed. In the case of matrices, the formula describes a 
step of Gauss elimination where the reduced row is removed 
and not permuted. This proves that the Smith form (or the 
rank) of the initial and reduced matrix will be the same, 
provided we add a number of trivial invariant factors equal 
to the number of rows reduced to the reduced Smith form. 

The important characteristic of this methods is that, thanks 
to the exactness of the matrix sequence, we can also remove 
row b and column a from the neighboring matrices. In this 
way, elimination in one matrix can propagate on the others. 

Due to the format of data (large files with matrix entries) 
we decided to implement only the simplest case of 1-rows 
removal which led to entries removal but no modifications. 
We removed the empty rows/columns at the same time and 
performed the whole reduction phase at the moment of read- 
ing the files. This led to vast matrix reduction in the case of 
GL-r(Z) matrices from the beginning of the sequence. The 
propagation of reductions unfortunately burned out near 
GL7dl4 matrix and stopped completely on GL7dl9. Ap- 
plying the process for the transposed sequence did not im- 
prove the solution. Next step would be to implement the 
propagation of Gauss elimination steps as in Eq. ((2J). It 
would be interesting to examine whether the burn-out can 
be connected to the loss of regularity for X/G and/or a huge 
rectangularity of the input matrix GLldlA. 

3. EXPERIMENTAL RESULTS 
3.1 Parallel implementation 

For the sake of simplicity in our experimental validation, 
we decided to develop our parallel implementation using 
shell tasks distribution on SMP architecture. Thus, a sim- 
ple script code is used to distribute all different tasks over 
all the processors and files are used to gather up the com- 
puted results. Our parallel implementation has been done 
as follow: 



1. The block of vectors Y and the sparse matrix A are 
broadcasted on every processors. 



2. Each process takes one column of the blackbox A .A 
and compute the corresponding column's sequence us- 
ing the first ith columns of Y. Each process writes the 
result in a file labeled with the corresponding index of 
the column sequence. 

3. When all previous processes have terminated, the cr- 
basis computation is sequentially performed after load- 
ing the sequence from the generated files. 

Despite the naive approach used for the parallelization, our 
implementation authorized us to perform very large compu- 
tations as show in next section. However, our experiments 
show a need for at least a more robust parallel computation 
scheduler. 

3.2 Rank and Smith form 

All our computation have been done on a SGI Altix 3700 
gathering 64 Itanium2 processors with 192Gb memory and 
running SuSE Linux Enterprise System 10. Further informa- 



tions on this platform are available at |http :/ /www . math . uwaterloo . ca/mf 

In Table[T]we include the information about the dimensions 
of the GLr(Z) matrices and their sparsity. The matrices are 
very sparse which is illustrated by the fact that less than 
1% of the entries are non-zero except for matrices GL7dlO 
and GL7dll. This value drops to less than 0,2% in the case 
of the largest matrices. Also in Table [T] we give the results 
for the rank and the Smith form computations. We have 
obtained a full information on the rank of GLt(X) matri- 
ces. For the computation of the Smith form, full result has 
been obtained in the case of matrices 10,11,12,13,25,26. For 
matrices 14 and 24 only the smooth part of the Smith form 
has been computed. For matrices 15 and 23 we have proved 
the existence of a non-trivial local Smith form at 2 and 3 
and a triviality of the local Smith form at 5. As the result 
for these matrices we give the number of invariant factors 
divisible by 2 and 3. 

In Table[2]we give the times for the Smith form algorithms 
used. For cases with * no data are available or relevant. 

The rank computation for GL-ji^L) matrices was performed 
modulo 65537. This FFT prime allowed us to use both 
BLAS routines and sigma-basis reconstruction using fast 
polynomial multiplication. In Table [3] we give the timings 
for different operations used in the computation i.e. the 
sparse matrix- vector product, BLAS-based matrix- vector prod- 
uct and, for the sake of comparison, the time of scalar dot 
product equivalent to the BLAS computation. 

In Table [4] we give the estimation of sequential and paral- 
lel cpu time of rank computation and compare it with the 
real time of parallel computation. The times are estimated 
based on the number of iterations and the times of one step 
which can be computed from Table [3] (notice, that in the 
scalar case we use 1 dot product instead of BLAS). The real 
time of computation includes the time of writing and read- 
ing the data which was considerable. The difference of the 
real and estimated running times may also be due to the 
overload of the computation cluster. Moreover, long com- 
putations suffered from system crashes and/or shutdowns. 
Some restoration scripts were used to recover the data which 
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Table 1: Results of the rank and Smith form computation for GLt(Z) matrix A of dimension n x m with fl 
non-zero entries. For (*) the information is incomplete - only divisors of the invariant factors were determined 
based on the rank mod 2 and 3 computation. 
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Table 2: Times for Smith Form computation for GLt{'L) matrices. From left to right: the dimensions of the 
matrix after reductions, rank approximation by reductions, time of reading and reducing the matrix, time 
for the adaptive algorithm for a reduced matrix; times for the original matrix: smooth form computation, 
adaptive algorithm; valence computation in parallel - sequential time equivalent. 
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Table 4: A summary of large-scale parallel rank computation. From left to right: number of iteration in the 
scalar case, time estimation in this case, number of iterations on p processors computed as 2 + 2 ■ r/p, average 
(real) time of sequence computation, estimated time on p processors, the time of the a basis computation for 
a sequence of length iter [p] of p x p matrices. 
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Table 3: CPU timings (in sec.) for different opera- 
tions used in large-scale parallel rank computation. 
All times in seconds. From left to right: time of 
a matrix-vector product, a BLAS multiplication of 
a vector and a 30 x min(n, m) matrix U and 30 dot 
products. 



unfortunately required re-running some part of the compu- 
tation. Thus, the real time given in Table UCol.5] should 
be treated as a rough approximation. 

4. CONCLUSION 

Using the previous methods and computations, we get the 
following new result for the rational cohomology of GLj(Z). 



Theorem 4.1. (Elbaz-Vincent/Gangl/SouleJlT^I We have 
H m (GL 7 (Z),Q) 



Q if m = 0, 5, 11, 14, 15, 
otherwise. 



Clearly the simple parallelization we used together with the 
highly optimized routines were the key to enable these com- 
putations. 

To go further and solve even larger problems, it is manda- 
tory to improve the parallelism. On SMP we can split the 
matrices into blocks and perform the matrix-vector prod- 
ucts with different threads. This, and the unbalanced load 
we had when we choose to assign one vector to one process, 
advocates for the use of more advanced scheduling. We are 
experimenting KAAPfl but were not ready for the compu- 
tation of GLj. 

Other improvements are of algorithmic type. They include 
the use of the sparse projections of [11] for the matrix se- 
quence. But then we loose the symmetry of the projections 
and therefore must pay a factor of two for the number of 
iterations. We could also use an early termination strategy 
to stop the iteration earlier, but up to now this require to 
loose the fast algorithm for the sigma bases. Then if a good 
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structure for the sparsity of the matrices could be found, 
e.g. an adapted reordering technique, this would enable an 
efficient clustering and therefore faster and more scalable 
matrix-vector products. 

In order to have the relevant part of the torsion of the in- 
tegral cohomology of GLr(Z), we would need the complete 
description of the Smith forms of all the matrices described 
above. Our experiments have shown that this can be an 
enormously difficult task. The computation remains to be 
done but would have applications in number theory. 
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